Sentence-level dialects identification in the greater China region

نویسندگان

  • Fan Xu
  • Mingwen Wang
  • Maoxi Li
چکیده

Identifying the different varieties of the same language is more challenging than unrelated languages identification. In this paper, we propose an approach to discriminate language varieties or dialects of Mandarin Chinese for the Mainland China, Hong Kong, Taiwan, Macao, Malaysia and Singapore, a.k.a., the Greater China Region (GCR). When applied to the dialects identification of the GCR, we find that the commonly used character-level or word-level uni-gram feature is not very efficient since there exist several specific problems such as the ambiguity and context-dependent characteristic of words in the dialects of the GCR. To overcome these challenges, we use not only the general features like character-level n-gram, but also many new word-level features, including PMI-based and word alignment-based features. A series of evaluation results on both the news and open-domain dataset from Wikipedia show the effectiveness of the proposed approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Declarative sentence intonation patterns in 8 swiss German dialects

This study examines declarative sentence intonation contours in 8 vastly different Swiss German dialects by the application of the Command-Response model. Fundamental frequency patterns of a controlled declarative sentence are analyzed on the global and local level of intonation. The results provide evidence of a different patterning for the dialects in the context of how global and local level...

متن کامل

Hierarchical Classification for Spoken Arabic Dialect Identification using Prosody: Case of Algerian Dialects

In daily communications, Arabs use local dialects which are hard to identify automatically using conventional classification methods. The dialect identification challenging task becomes more complicated when dealing with an under-resourced dialects belonging to a same county/region. In this paper, we start by analyzing statistically Algerian dialects in order to capture their specificities rela...

متن کامل

Arabic Dialect Identification

The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a nontrivial manner from the various spoken regional dialects of Arabic – the true “native languages” of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. In this article, we des...

متن کامل

Manifestation of downstep and intonation in Japanese: comparison of the Tokyo and Kochi dialects

This paper examines the manifestation of downstep and intonation in the Tokyo and Kochi dialects of Japanese by using three types of syntactically balanced material adjective phrases, adverbial phrases, and sentence modifiers. The main conclusion is that Kochi speakers produce a smaller Major Phrase consisting of fewer lexical accents than in the Tokyo dialect, the Major Phrase being defined as...

متن کامل

The Use of Perception Tests in Studying the Tonal System of Prinmi Dialects: A Speaker-centered Approach to Descriptive Linguistics

Contrary to previous description based on the Mandarin model of syllable-tone system, Xinyingpan, a dialect of Prinmi (a Tibeto-Burman language of China), has been discovered to possess a melody-tone system (or “pitch-accent” system) akin to that of Japanese. Targeting the crux of the unusual characteristics of this melody-tone system, where neutralization of two tonal categories in citation fo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1701.01908  شماره 

صفحات  -

تاریخ انتشار 2016